Acartia tonsa is a calanoid copepod that has a world wide distribution. It inhabits estuaries and coastal waters and it typically the most abundant zooplankton present. Because of their huge population sizes and global distribution, they play an important role in ocean biogeochemical cycling and ecosystem functioning. For example, they’re an important food base for many fishes. Given their broad distribution in dynamic habitats (estuaries), they can tolerate and live in a broad range of salinities, freshwater to seawater, and temperatures, among other stressors.
We know that the world is rapidly changing due to human activities and it is important to understand how species will respond to these changes. We were interested in understanding how A. tonsa could respond to stressful environments that will likely be caused by climate change. Can they adapt to rapid changes in temperature and acidity? How might epigenetic responses interact with adaptation to enable resilience?
A. tonsa is a great system to ask these questions for a number of reasons. First, their generation time is only ~10-20 days and they’re tiny! Only 1-2 mm. This means we can easily execute multi-generational studies with thousands of individuals. Second, because they can tolerate such variable environments, they should have lots of plasticity to respond to environmental fluctuations. Finally, their large population sizes mean that genetic diversity should be high, giving them high adaptive potential. This means that if any species can respond to rapid environmental change, it is likely A. tonsa.
A. tonsa were collected from the wild, common gardened for three generations, and then split into four treatments with four replicates each and about 3,000-5,000 individuals per replicate. They were left to evolve in these conditions for 25 generations. Samples were collected at generation F0 and F25 to quantify methylation frequencies using reduced representation bisulfite sequencing (RRBS).
The experimental design
| Sample ID | Treatment | Trt abrreviation | Generation |
|---|---|---|---|
| AA_F00_1 | Control: ambient temp, ambient CO2 | AA | F00 |
| AA_F25_2 | Control: ambient temp, ambient CO2 | AA | F25 |
| AH_F25_3 | ambient temp, high CO2 | AH | F25 |
| HA_F25_1 | high temp, ambient CO2 | HA | F25 |
| HH_F25_1 | high temp, high CO2 | HH | F25 |
Following the adapter ligation, we bisulfite convert all unmethylated C’s to T’s.
Before starting we also spiked in a small amount of DNA from E. coli that we know wasn’t methylated. Using this, we can calculate downstream how efficient our bisulfite conversion was.
How RRBS works
Flowchart for capabilities of methylKit package
Unlike the normal sequence reads, in bisulfite conversion there is a high no. of T bases. This increase is because of the fact that all the non-methylated cytosines are converted to thyamine bases. While the that the reverse strand is the reverse complement of the bisulfite converted forward strand, and is not just the BS converted bottom strand. Thus reverse read being complement to the forward read will have a higher no. of adenine bases.
Why bisulfite fastq visualization looks weird
FastQ for the bisulfite convereted forward read after trimming
FastQ for the bisulfite convereted reverse read after trimming
Sample assigned
- AH_F25_4
Bowtie 1 (Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25), or alternatively Bowtie 2, and therefore it is a requirement that Bowtie 1 (or Bowtie 2) are also installed on your machine (see Dependencies).All files associated with Bismark as well as a test BS-Seq data set can be downloaded from: http://www.bioinformatics.babraham.ac.uk/projects/bismark/
USAGE: bismark [options] <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}
ARGUMENTS:
<genome_folder>
--genome_folder /path/to/genome/folder/.-1 <mates1>
-2 <mates2>
<singles>
type bismark in bash to start the program
bismark -help for help with the wrapper
--output_dir - one output directory for all the data
--rg_tag--rg_id - sample id--gzip - compress--local - local alignment / soft clipping of the bases--maxins - max insert size
#!/bin/bash
bismark --bowtie2 --multicore 1 \
--genome /data/project_data/epigenetics/reference_genome \
--output_dir /data/project_data/epigenetics/aligned_output \
-1 /data/project_data/epigenetics/trimmed_fastq/AH_F25_4_1.fq.gz \
-2 /data/project_data/epigenetics/trimmed_fastq/AH_F25_4_2.fq.gz \
--rg_tag --rg_id AH_F25_4 --rg_sample AH_F25_4 --gzip --local --maxins 1000
Bismark is, at its core, running bowtie2. But there are important differences. If you remember, we’re using two modified versions of the genome where we’ve converted C-to-T AND G-to-A. We also generate temporary files of our trimmed reads where we convert C-to-T (read1) AND G-to-A (read2) so they can map to this converted genome. This makes the alignment a bit harder because : 1. the complexity of DNA is reduced
2. we are mapping to two separate genomes.
Parameters
--bowtie2 tells bismark to map with bowtie2. There are other options possible here (bowtie1, for example)
--multicore 1 Use one core. We could use multiple cores to make alignment faster, but we may crash the server if many of you are mapping at once if we do this.
--genome the location of our converted genome. The bismark package has a command to prep your genome that I’ve already done. You can check it out: bismark_methylation_extractor --help
-1 the path to your sample’s forward read
-2 the path to your sample’s reverse read
--rg_tag add a read group tag that identifies your individual sample in your output bam file
--rg_id the string that defines the readgroup ID
--rg_sample the string that defines your readgroup sample ID
--gzip for the temporary files, use gzip to save space
--local align with the local alignment option in bowtie2. This will include soft clipping, which should increase mapping rate, but comes at the cost of (maybe) increasing mis-mapping.
--maxins specifies the maximum insert size for mapping. We’re using 1000 as our libraries had a mean insert size of ~200. Note that this is smaller than most sequencing libraries.
#!/bin/bash
bismark --bowtie2 --multicore 1 \
--genome /data/project_data/epigenetics/reference_genome \
--output_dir /data/project_data/epigenetics/aligned_output \
-1 /data/project_data/epigenetics/trimmed_fastq/SAMPLEID_1.fq.gz \
-2 /data/project_data/epigenetics/trimmed_fastq/SAMPLEID_2.fq.gz \
--rg_tag --rg_id SAMPLEID --rg_sample SAMPLEID --gzip --local --maxins 1000
Here the SAMPLEID was replaced with my sample id AH_F25_4
On Mac, use the following: zcat < AA_F00_1_1_bismark_bt2_pe.bismark.cov.gz
Mapping rate
- The mapping rate is very high in AA_F00, the control at zero generations, which can be worrying. This might be due to the fact that slightly more samples went into sequecing AA_F00.
- So including more DNAs might actually help increase the mapping rate.
- Maybe mapping them as single ends rather than paired end might help increase the mapping rate.
- THe bisulfite conversion rate was checked and found out to be 98-99%, when compared with the E.coli that was spiked in.
Mapping rates for different treatements
bismark_methylation_extractor
USAGE: bismark_methylation_extractor [options] <filenames>
A typical command to extract context-dependent (CpG/CHG/CHH) methylation could look like this:
bismark_methylation_extractor -s –comprehensive
test_dataset.fastq_bismark.sam
This will produce three output files:
(a) CpG_context_test_dataset.fastq_bismark.txt
(b) CHG_context_test_dataset.fastq_bismark.txt
(c) CHH_context_test_dataset.fastq_bismark.txt
bismark_methylation_extractor --bedGraph --scaffolds --gzip \
--cytosine_report --comprehensive \
--no_header \
--parallel 6 \
--output ~/tonsa_epigenetics/analysis/methylation_extract/ \
--genome_folder /data/copepods/tonsa_genome/ \
*pe.bam
M-Bias Plot
- Methylation Bias Plot
- CpG methylation is the dark line across the length of the read, and we are intersted in this line
- CpG methylation is same across the read, which is what we expect to see
- CpG reads are very high at the begining for read 1, we are unsure of what it is so we are going to cut the first two bases to remove this bias.
- It is generally recommended to remove this bias
- The same spike was also seen in the spiked E.coli
- The mechanism that is causing it is very unclear
- Generally considered as a technical artifact rather than a biological artifact
- Read 2 shows a bit of a dip in CpG at the beginning
To ignore the first two bases:
- Done to remove the high CpG read at the begining of the reads
bismark_methylation_extractor --bedGraph --scaffolds --gzip \
--cytosine_report --comprehensive \
--no_header \
--parallel 6 \
--ignore 2 --ignore_r2 2 \
# this says to ignore the first two bases
--output ~/tonsa_epigenetics/analysis/methylation_extract/ \
--genome_folder /data/copepods/tonsa_genome/ \
*pe.bam
We can look at sites with some data using the following:
zcat HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz | head
Column 1: Chromosome
Column 2: Start position
Column 3: End position
- Both start and stop column of the base we are intersted in are going to be same, since we are looking at single bases
Column 4: Methylation percentage
Column 5: Number of methylated C’s
Column 6: Number of unmethylated C’s
zcat HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz | awk '$4 > 5 && $5 > 10' | head
says column 4 values > 5
says column 5 > ten methylated reads
Packages required:
require(tidyverse)
require(pheatmap)
# fixing issues with methylKit package installation
# BiocManager::install("methylKit")
# install.packages("tibble", dependencies = TRUE, INSTALL_opts = '--no-lock')
require(methylKit)
require(plotly)
require(patchwork)
# first, we want to read in the raw methylation calls with methylkit
# set directory with absolute path (why is this necessary? I have no idea, but gz files wont work with relative paths)
dir <- "C:\\Epigenetics_data\\Epigenetics_data\\"
list.files(dir)
## [1] "AA_F00_1_1_bismark_bt2_pe.bismark.cov.gz"
## [2] "AA_F00_2_1_bismark_bt2_pe.bismark.cov.gz"
## [3] "AA_F00_3_1_bismark_bt2_pe.bismark.cov.gz"
## [4] "AA_F00_4_1_bismark_bt2_pe.bismark.cov.gz"
## [5] "AA_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [6] "AA_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [7] "AA_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [8] "AA_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [9] "AH_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [10] "AH_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [11] "AH_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [12] "AH_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [13] "bed_files"
## [14] "checklist.chk"
## [15] "HA_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [16] "HA_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [17] "HA_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [18] "HA_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [19] "HH_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [20] "HH_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [21] "HH_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [22] "HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [23] "methylBase_united.txt.bgz"
## [24] "methylBase_united.txt.bgz.tbi"
## [25] "sample_id.txt"
# read in the sample ids
samples <- read.table("C:\\Epigenetics_data\\Epigenetics_data\\sample_id.txt",
header = FALSE)
# now point to coverage files
files <- file.path(dir, samples$V1) # pasting the directory and samples together
all(file.exists(files)) # to check if the path is real and points to the correct location
## [1] TRUE
# convert to list
file.list <- as.list(files) # for methylKit to run properly
# get the names only for naming our samples
nmlist <- as.list(gsub("_1_bismark_bt2_pe.bismark.cov.gz","",samples$V1))
head(nmlist)
## [[1]]
## [1] "AA_F00_1"
##
## [[2]]
## [1] "AA_F00_2"
##
## [[3]]
## [1] "AA_F00_3"
##
## [[4]]
## [1] "AA_F00_4"
##
## [[5]]
## [1] "AA_F25_1"
##
## [[6]]
## [1] "AA_F25_2"
# use methRead to read in the coverage files
## converting to a methylKit object from the list made
myobj <- methRead(location = file.list, # path to the files stored
sample.id = nmlist, # sample ID info
assembly = "atonsa", # this is just a string. no actual database
dbtype = "tabix", # going to make a database in the form of .txt files
# this is also good for saving up on the memory
context = "CpG",
resolution = "base", # resolution is for SNPs so at a base elevel
mincov = 20, # min 20 reads per site to get accurate estimation
# 10 is an acceptable value too
treatment = c(0,0,0,0, # four AA_F00 files
1,1,1,1, # four AA_F25 files
2,2,2,2, # four AH_F25 files
3,3,3,3, # four HA_F25 files
4,4,4,4), # four HH_F25 files
pipeline = "bismarkCoverage", # input data format
dbdir = "C:\\Epigenetics_data\\") # not necessary to be specified
######
# visualize coverage and filter
######
# We can look at the coverage for individual samples with getCoverageStats()
getCoverageStats(myobj[[1]], plot = TRUE, both.strands = FALSE) #Get CpG coverage information
# numbers on bars denote what percentage of locations are contained in that bin
# Experiments that are highly suffering from PCR duplication bias will have
# a secondary peak towards the right hand side of the histogram
myobj[[1]] # no. of Cs and Ts should equal coverage
## methylRawDB object with 67815 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 LS041577.1 13445 13445 * 24 0 24
## 2 LS041577.1 13454 13454 * 24 1 23
## 3 LS041577.1 13457 13457 * 24 0 24
## 4 LS041577.1 13469 13469 * 24 0 24
## 5 LS041577.1 13472 13472 * 24 0 24
## 6 LS041577.1 13490 13490 * 23 0 23
## --------------
## sample.id: AA_F00_1
## assembly: atonsa
## context: CpG
## resolution: base
## dbtype: tabix
# looking at percent methylation
getMethylationStats(myobj[[1]],plot=TRUE,both.strands=FALSE)
# and can plot all of our samples at once to compare.
# par(mfrow=c(5,4))
#
# for(i in 1:length(myobj)) {
# getCoverageStats(myobj[[i]], plot = TRUE, both.strands = FALSE) #Get CpG coverage information
# } #Plot and save %CpG methylation information
#
# par(mfrow=c(1,1))
The histogram of CpG coverage for all the samples
# filter samples by depth with filterByCoverage()
filtered.myobj <- filterByCoverage(myobj,
lo.count = 20, lo.perc = NULL,
hi.count = NULL, hi.perc = 97.5,
db.dir = "C:\\Epigenetics_data\\")
######
# merge samples
######
#Note! This takes a while and we're skipping it
# use unite() to merge all the samples. We will require sites to be present in each sample or else will drop it
# don't run as it takes a lot of time
# meth <- unite(filtered.myobj,
# mc,cores=3,
# suffix="united", # add united at the end of the name
# db.dir="C:\\Epigenetics_data\\")
Methylkit has a convenient aspect where we can load previously generated databases. This means we don’t have to re-run analyses, but can skip ahead to where we previously left off. We will do this below to load the united database that I’ve already generated.
meth <- methylKit:::readMethylBaseDB(
dbpath = "C:\\Epigenetics_data\\Epigenetics_data\\methylBase_united.txt.bgz",
dbtype = "tabix",
sample.id = unlist(nmlist),
assembly = "atonsa", # this is just a string. not an actual database
context = "CpG",
resolution = "base",
treatment = c(0,0,0,0,
1,1,1,1,
2,2,2,2,
3,3,3,3,
4,4,4,4),
destrand = FALSE)
# percMethylation() calculates the percent methylation for each site and sample
pm <- percMethylation(meth)
head(pm) # methylation percentage for each site per each sample
## AA_F00_1 AA_F00_2 AA_F00_3 AA_F00_4 AA_F25_1 AA_F25_2 AA_F25_3
## [1,] 0.000000 5.000000 7.692308 2.564103 1.123596 3.0000000 3.9603960
## [2,] 4.166667 2.500000 0.000000 0.000000 10.112360 0.9900990 0.9708738
## [3,] 0.000000 2.564103 3.846154 0.000000 3.370787 0.9803922 0.9615385
## [4,] 0.000000 2.500000 3.846154 0.000000 1.086957 2.9411765 0.0000000
## [5,] 0.000000 2.500000 7.692308 0.000000 1.086957 2.9702970 0.0000000
## [6,] 0.000000 2.500000 3.846154 0.000000 2.150538 0.0000000 1.9417476
## AA_F25_4 AH_F25_1 AH_F25_2 AH_F25_3 AH_F25_4 HA_F25_1 HA_F25_2
## [1,] 2.8735632 3.6866359 2.2388060 2.5423729 4.1095890 9.375000 2.816901
## [2,] 1.1494253 0.0000000 3.6764706 0.0000000 1.3698630 3.125000 1.408451
## [3,] 1.1494253 0.9090909 0.7407407 0.8474576 2.7397260 0.000000 0.000000
## [4,] 3.9772727 2.2727273 0.7352941 0.0000000 0.6849315 0.000000 0.000000
## [5,] 0.5714286 0.4608295 2.2058824 1.6949153 2.0689655 0.000000 1.408451
## [6,] 3.4090909 1.8348624 0.7299270 0.8403361 1.3698630 4.477612 0.000000
## HA_F25_3 HA_F25_4 HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4
## [1,] 3.030303 2.439024 4.761905 0 1.470588 0.000000
## [2,] 0.000000 1.219512 1.190476 0 0.000000 2.083333
## [3,] 1.515152 1.219512 3.571429 0 1.470588 4.166667
## [4,] 0.000000 1.219512 1.190476 0 0.000000 0.000000
## [5,] 0.000000 0.000000 1.204819 0 5.882353 4.166667
## [6,] 2.898551 2.439024 0.000000 0 1.470588 2.083333
#plot methylation histograms
ggplot(gather(as.data.frame(pm)),
aes(value)) +
geom_histogram(bins = 10, color="black", fill="grey") +
facet_wrap(~key)
- none of these are fully methylated
- biological reason - pools of individuals, so if any slight change in one individual can affect the overall methylation levels for any individual site
- differenes can also be present because of differenes due to different cell types
- this U pattern is quite common in model organisms, with high methyaltion at zero and 100, and very low in between
- The reason could also be technical
# calculate and plot mean methylation
sp.means <- colMeans(pm)
sp.means
## AA_F00_1 AA_F00_2 AA_F00_3 AA_F00_4 AA_F25_1 AA_F25_2 AA_F25_3 AA_F25_4
## 41.09502 41.61804 41.50262 40.83576 36.30440 38.37196 39.92734 40.73100
## AH_F25_1 AH_F25_2 AH_F25_3 AH_F25_4 HA_F25_1 HA_F25_2 HA_F25_3 HA_F25_4
## 38.25740 36.03000 38.19356 39.57069 39.13568 40.87818 38.46266 38.78108
## HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4
## 37.59602 38.26307 38.42900 39.20101
dim(pm) # methlation for 15158 sites in 20 samples
## [1] 15158 20
p.df <- data.frame(sample=names(sp.means),
group = substr(names(sp.means), 1,6),
methylation = sp.means)
ggplot(p.df, aes(x=group, y=methylation, color=group)) +
stat_summary(color="black") + geom_jitter(width=0.1, size=3) +
ylab("Percent Methylation") + xlab("Treatment") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold"))
## No summary function supplied, defaulting to `mean_se()`
## is the high methylation rate high in AA_F00 or is it due to it having higher mapping rate compared to the rest of the samples?? - hard to know
PCA
# sample clustering
# looking for similarities and differences between the samples / treatments
clusterSamples(meth,
dist = "correlation",
method = "ward.D",
plot = TRUE)
##
## Call:
## hclust(d = d, method = HCLUST.METHODS[hclust.method])
##
## Cluster method : ward.D
## Distance : pearson
## Number of objects: 20
# they are not clustering by treatments
# F00 samples are clustering together but not so much in other treatments
# PCA
PCASamples(meth)
# similarly messy
find differentially methylated sites between two groups
# subset with reorganize()
meth_sub <- reorganize(meth,
sample.ids =c("AA_F00_1","AA_F00_2","AA_F00_3", "AA_F00_4",
"HH_F25_1","HH_F25_2","HH_F25_3","HH_F25_4"),
treatment = c(0,0,0,0,1,1,1,1), # subsetting to just 8 samples
save.db=FALSE)
# calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
overdispersion = "MN",
mc.cores = 1,
suffix = "AA_HH",
adjust = "qvalue", # FDR correction - commonly applied to genomic analysis
test = "Chisq") # chi square
## two groups detected:
## will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# this model of overdispersion and test statistic seems to be the best
the methylation proportions are weighted by their coverage, as in a typical logistic regression. Note that in theory you could enter these as two column success and failure data frame, which is common in logistic regressions.
overdispersion with Chisq.
# get all differentially methylated bases
myDiff <- getMethylDiff(myDiff,
qvalue = 0.05, # taking out only the significant ones
difference = 10) # methylation differences that is taken to count as significant
# 10 is an arbitary number- it can range from 10 to 25%
# we can visualize the changes in methylation frequencies quickly.
head(getData(myDiff)$meth.diff)
## [1] -10.16196 -17.85524 -13.96676 -13.15037 10.08852 10.08054
hist(getData(myDiff)$meth.diff) # methylation differences
# hist before zero towards the left are hypomethylated and those towards the right are hypermethylated
# we see a lower methylation % in HH samples compared to AA saples
# get hyper methylated bases
hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")
hist(getData(hyper)$meth.diff)
# get hypo methylated bases
hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")
hist(getData(hypo)$meth.diff)
#heatmaps first
pm <- percMethylation(meth_sub)
# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in,]
# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr,
getData(myDiff)$start, sep=":"),
din, pm.sig)
colnames(df.out) <- c("snp",
colnames(din),
colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[,5:7]))
####
# heatmap
####
my_heatmap <- pheatmap(pm.sig,
show_rownames = FALSE) # red is higher methylation
# we can also normalize
ctrmean <- rowMeans(pm.sig[,1:4])
h.norm <- (pm.sig-ctrmean)
my_heatmap <- pheatmap(h.norm,
show_rownames = FALSE)
##### if you want to change colors. only because I don't love the default colors.
paletteLength <- 50
myColor <- colorRampPalette(c("cyan1", "black", "yellow1"))(paletteLength)
myBreaks <- c(seq(min(h.norm), 0, length.out=ceiling(paletteLength/2) + 1),
seq(max(h.norm)/paletteLength, max(h.norm), length.out=floor(paletteLength/2)))
my_heatmap <- pheatmap(h.norm,
color=myColor,
breaks=myBreaks,
show_rownames = FALSE)
#####
#let's look at methylation of specific snps
####
# convert data frame to long form
head(df.out,3)
## snp chr start end AA_F00_1 AA_F00_2 AA_F00_3 AA_F00_4
## 23 LS041600.1:376 LS041600.1 376 376 84.37500 83.97436 84.28571 83.57143
## 29 LS041600.1:475 LS041600.1 475 475 77.89474 75.00000 89.04110 80.55556
## 58 LS042377.1:90 LS042377.1 90 90 61.85286 59.91903 58.59155 58.15085
## HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4 pvalue qvalue meth.diff
## 23 76.66667 77.47748 70.17544 70.00000 6.463564e-04 0.0086976959 -10.16196
## 29 45.45455 66.03774 67.50000 68.62745 3.809827e-03 0.0291370853 -17.85524
## 58 53.57143 46.09929 40.42553 49.15254 4.128226e-06 0.0002458728 -13.96676
df.plot <- df.out[,c(1,5:12)] %>%
pivot_longer(-snp, values_to = "methylation")
df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
## snp name methylation group
## <fct> <chr> <dbl> <chr>
## 1 LS041600.1:376 AA_F00_1 84.4 AA
## 2 LS041600.1:376 AA_F00_2 84.0 AA
## 3 LS041600.1:376 AA_F00_3 84.3 AA
## 4 LS041600.1:376 AA_F00_4 83.6 AA
## 5 LS041600.1:376 HH_F25_1 76.7 HH
## 6 LS041600.1:376 HH_F25_2 77.5 HH
# looking at snp LS041600.1:376
df.plot %>% filter(snp=="LS041600.1:376") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black")
## write bed file for intersection with genome annotation
write.table(file = "C:\\Epigenetics_data\\Epigenetics_data\\bed_files\\diffmeth.bed",
data.frame(chr= df.out$chr, start = df.out$start, end = df.out$end),
row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
code template
- first path –> the location of the package stored - bedtools
- closest - function in bedtools calling for closest features between samples
- more details here
- > hits.bed - assigning name to the output file
/data/popgen/bedtools2/bin/bedtools closest -a DIFF.bedFILELOCATION \
-b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
-D b | \
awk '!($10=="-")' > hits.bed
#!/bin/bash
/data/popgen/bedtools2/bin/bedtools closest -a /users/a/p/ap1/EcologicalGenomics/myresults/bedfiles_copepod/diffmeth.bed\
-b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
-D b | \
awk '!($10=="-")' > hits.bed # save filename is hits.bed
# the annotation file here has the format: ScaffoldName FromPosition ToPosition Sense TranscriptName TranscriptPath GeneAccession GeneName GeneAltNames GenePfam GeneGo CellularComponent MolecularFunction BiologicalProcess
# note that the hits.bed file will paste the diffmeth.bed file before the annotation table. So the first 3 columns are fom diffmeth.bed, then next 8 from the annotation table.
# count up number of hits
cat hits.bed | wc -l
# count number of unique named genes
cat hits.bed | cut -f 8 | sort | uniq -c
find differentially methylated sites between two groups
# subset with reorganize()
meth_sub <- reorganize(meth,
sample.ids =c("AA_F25_1","AA_F25_2","AA_F25_3", "AA_F25_4",
"AH_F25_1","AH_F25_2","AH_F25_3","AH_F25_4"),
treatment = c(0,0,0,0,1,1,1,1), # subsetting to just 8 samples
save.db=FALSE)
# calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
overdispersion = "MN",
mc.cores = 1,
suffix = "AA_AH",
adjust = "qvalue",
test = "Chisq")
## two groups detected:
## will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
# get all differentially methylated bases
myDiff <- getMethylDiff(myDiff,
qvalue = 0.05,
difference = 10)
head(getData(myDiff)$meth.diff)
## [1] -13.38261 -12.50014 16.57960 -10.85679 -11.15066 -11.61221
hist(getData(myDiff)$meth.diff) # methylation differences
# hist before zero towards the left are hypomethylated and those towards the right are hypermethylated
# we see a lower methylation % in AH_F25 samples compared to AA_F25 samples
# get hyper methylated bases
hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")
dim(hyper)
## [1] 6 7
# get hypo methylated bases
hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")
dim(hypo)
## [1] 24 7
par(mfrow=c(1,2))
hist(getData(hyper)$meth.diff)
hist(getData(hypo)$meth.diff)
par(mfrow=c(1,2))
Plots of differentially methylated groups
#heatmaps first
pm <- percMethylation(meth_sub)
# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in,]
# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr,
getData(myDiff)$start, sep=":"),
din, pm.sig)
colnames(df.out) <- c("snp",
colnames(din),
colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[,5:7]))
####
# heatmap
####
my_heatmap <- pheatmap(pm.sig,
show_rownames = FALSE) # red is higher methylation
# we can also normalize
ctrmean <- rowMeans(pm.sig[,1:4])
h.norm <- (pm.sig-ctrmean)
my_heatmap <- pheatmap(h.norm,
show_rownames = FALSE)
AH_heat <- my_heatmap
##### Recoloring the heatmaps.
paletteLength <- 50
myColor <- colorRampPalette(c("cyan1", "black", "yellow1"))(paletteLength)
myBreaks <- c(seq(min(h.norm), 0, length.out=ceiling(paletteLength/2) + 1),
seq(max(h.norm)/paletteLength, max(h.norm), length.out=floor(paletteLength/2)))
my_heatmap <- pheatmap(h.norm,
color=myColor,
breaks=myBreaks,
show_rownames = FALSE)
#####
#let's look at methylation of specific snps
####
# convert data frame to long form
head(df.out,3)
## snp chr start end AA_F25_1 AA_F25_2 AA_F25_3 AA_F25_4
## 1965 LS051659.1:1214 LS051659.1 1214 1214 62.50000 69.79866 68.31683 70.14218
## 2251 LS053241.1:414 LS053241.1 414 414 42.34694 41.46341 31.70732 39.47368
## 3451 LS068114.1:6322 LS068114.1 6322 6322 30.35714 28.57143 15.73034 21.60494
## AH_F25_1 AH_F25_2 AH_F25_3 AH_F25_4 pvalue qvalue meth.diff
## 1965 48.33333 49.47368 62.40602 56.06061 5.298836e-05 0.04100001 -13.38261
## 2251 30.18868 24.83660 27.77778 28.57143 1.235214e-04 0.04468703 -12.50014
## 3451 43.44262 33.00000 42.02899 39.13043 1.789182e-05 0.02076583 16.57960
df.plot <- df.out[,c(1,5:12)] %>%
pivot_longer(-snp, values_to = "methylation")
df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
## snp name methylation group
## <fct> <chr> <dbl> <chr>
## 1 LS051659.1:1214 AA_F25_1 62.5 AA
## 2 LS051659.1:1214 AA_F25_2 69.8 AA
## 3 LS051659.1:1214 AA_F25_3 68.3 AA
## 4 LS051659.1:1214 AA_F25_4 70.1 AA
## 5 LS051659.1:1214 AH_F25_1 48.3 AH
## 6 LS051659.1:1214 AH_F25_2 49.5 AH
# looking at snp LS051659.1:1214
AH_snp <- df.plot %>% filter(snp=="LS051659.1:1214") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black")
ggplotly(AH_snp)
## write bed file for intersection with genome annotation
# write.table(file = "C:\\Epigenetics_data\\Epigenetics_data\\bed_files\\AA_AH_diffmeth.bed",
# data.frame(chr= df.out$chr, start = df.out$start, end = df.out$end),
# row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
AH_snp1 <- df.plot %>% filter(snp=="LS274951.1:141") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black") +
ylab("Percent Methylation") + xlab("") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold"),legend.position = "none") +
ggtitle("Endonuclease III-like protein 1 ") + ylim(0,95)
AH_snp2 <- df.plot %>% filter(snp=="LS068114.1:6322") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black") +
ylab("") + xlab("") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold"),legend.position = "none") +
ggtitle("Deoxyuridine 5’-triphosphate nucleotidohydrolase") + ylim(0,95)
AH_snp3 <- df.plot %>% filter(snp=="LS328870.1:810") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black") +
ylab("") + xlab("") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold")) +
ggtitle("Retrovirus-related Pol polyprotein from transposon 17.6") + ylim(0,95)
Link to genes
/data/popgen/bedtools2/bin/bedtools closest -a /users/a/p/ap1/EcologicalGenomics/myresults/bedfiles_copepod/AA_AH_diffmeth.bed\
-b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
-D b | \
awk '!($10=="-")' > AA_AH_hits.bed # save filename is hits.bed
# count up number of hits
cat AA_AH_hits.bed | wc -l #18
# count number of unique named genes
cat AA_AH_hits.bed | cut -f 8 | sort | uniq -c
find differentially methylated sites between two groups
# subset with reorganize()
meth_sub <- reorganize(meth,
sample.ids =c("AA_F25_1","AA_F25_2","AA_F25_3", "AA_F25_4",
"HH_F25_1","HH_F25_2","HH_F25_3","HH_F25_4"),
treatment = c(0,0,0,0,1,1,1,1), # subsetting to just 8 samples
save.db=FALSE)
# calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
overdispersion = "MN",
mc.cores = 1,
suffix = "AA_HH",
adjust = "qvalue",
test = "Chisq")
## two groups detected:
## will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
# get all differentially methylated bases
myDiff <- getMethylDiff(myDiff,
qvalue = 0.05,
difference = 10)
head(getData(myDiff)$meth.diff)
## [1] 12.36264 -13.76137 31.97324 32.09734 10.66545 10.84294
hist(getData(myDiff)$meth.diff) # methylation differences
# hist before zero towards the left are hypomethylated and those towards the right are hypermethylated
# we see a sightly higher methylation % in HH_F25 samples compared to AA_F25 samples
# get hyper methylated bases
hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")
dim(hyper)
## [1] 66 7
# get hypo methylated bases
hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")
dim(hypo)
## [1] 55 7
par(mfrow=c(1,2))
hist(getData(hyper)$meth.diff)
hist(getData(hypo)$meth.diff)
par(mfrow=c(1,2))
Plots of differentially methylated groups
#heatmaps first
pm <- percMethylation(meth_sub)
# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in,]
# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr,
getData(myDiff)$start, sep=":"),
din, pm.sig)
colnames(df.out) <- c("snp",
colnames(din),
colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[,5:7]))
####
# heatmap
####
my_heatmap <- pheatmap(pm.sig,
show_rownames = FALSE) # red is higher methylation
# we can also normalize
ctrmean <- rowMeans(pm.sig[,1:4])
h.norm <- (pm.sig-ctrmean)
my_heatmap <- pheatmap(h.norm,
show_rownames = FALSE)
HH_heat <- my_heatmap
##### Recoloring the heatmaps.
paletteLength <- 50
myColor <- colorRampPalette(c("cyan1", "black", "yellow1"))(paletteLength)
myBreaks <- c(seq(min(h.norm), 0, length.out=ceiling(paletteLength/2) + 1),
seq(max(h.norm)/paletteLength, max(h.norm), length.out=floor(paletteLength/2)))
my_heatmap <- pheatmap(h.norm,
color=myColor,
breaks=myBreaks,
show_rownames = FALSE)
#####
#let's look at methylation of specific snps
####
# convert data frame to long form
head(df.out,3)
## snp chr start end AA_F25_1 AA_F25_2 AA_F25_3
## 124 LS042748.1:7742 LS042748.1 7742 7742 6.122449 8.139535 10.526316
## 262 LS043541.1:2223 LS043541.1 2223 2223 76.800000 77.500000 79.591837
## 393 LS043710.1:571 LS043710.1 571 571 32.258065 17.391304 6.569343
## AA_F25_4 HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4 pvalue qvalue
## 124 9.677419 19.60784 20.00000 29.72973 14.28571 0.0003768805 0.03738142
## 262 75.806452 75.67568 61.68224 66.21622 48.38710 0.0003671386 0.03721519
## 393 10.650888 29.16667 51.51515 59.09091 24.13793 0.0004571181 0.04267285
## meth.diff
## 124 12.36264
## 262 -13.76137
## 393 31.97324
df.plot <- df.out[,c(1,5:12)] %>%
pivot_longer(-snp, values_to = "methylation")
df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
## snp name methylation group
## <fct> <chr> <dbl> <chr>
## 1 LS042748.1:7742 AA_F25_1 6.12 AA
## 2 LS042748.1:7742 AA_F25_2 8.14 AA
## 3 LS042748.1:7742 AA_F25_3 10.5 AA
## 4 LS042748.1:7742 AA_F25_4 9.68 AA
## 5 LS042748.1:7742 HH_F25_1 19.6 HH
## 6 LS042748.1:7742 HH_F25_2 20 HH
# looking at snp LS042748.1:7742
HH_snp <- df.plot %>% filter(snp=="LS219431.1:243") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black")
ggplotly(HH_snp)
# write bed file for intersection with genome annotation
# write.table(file = "C:\\Epigenetics_data\\Epigenetics_data\\bed_files\\AA_HH_diffmeth.bed",
# data.frame(chr= df.out$chr, start = df.out$start, end = df.out$end),
# row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
# summary table
df.plot_filter1 <- df.plot %>% filter(group=="AA") %>%
group_by(snp) %>%
summarize(avgMeth = mean(methylation, na.rm = TRUE))
df.plot_filter1$Group <- "AA"
df.plot_filter2 <- df.plot %>% filter(group=="HH") %>%
group_by(snp) %>%
summarize(avgMeth = mean(methylation, na.rm = TRUE))
df.plot_filter2$Group <- "HH"
df.plot_filtered <- merge(df.plot_filter1,df.plot_filter2, by="snp")
df.plot_filtered <- df.plot_filtered %>% arrange(desc(avgMeth.x))
head(df.plot_filtered) # the highest average methylated snps
## snp avgMeth.x Group.x avgMeth.y Group.y
## 1 LS219431.1:243 92.83155 AA 81.34244 HH
## 2 LS239438.1:1057 90.13634 AA 77.00420 HH
## 3 LS311600.1:407 89.29486 AA 76.59789 HH
## 4 LS346027.1:369 88.66368 AA 77.91239 HH
## 5 LS058076.1:184 85.31733 AA 75.20231 HH
## 6 LS313322.1:157 84.10741 AA 67.88553 HH
HH_snp1 <- df.plot %>% filter(snp=="LS049748.1:4514") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black") +
ylab("Percent Methylation") + xlab("") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold"),legend.position = "none") +
ggtitle("STE20-like serine/threonine-protein kinase") + ylim(0,95)
HH_snp2 <- df.plot %>% filter(snp=="LS051669.1:9015") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black") +
ylab("") + xlab("Treatment") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold"),legend.position = "none") +
ggtitle("Probable RNA-directed DNA polymerase from transposon BS") + ylim(0,95)
HH_snp3 <- df.plot %>% filter(snp=="LS070286.1:6277") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black") +
ylab("") + xlab("") + theme_bw() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=15,face="bold")) +
ggtitle("Indolethylamine N-methyltransferase") + ylim(0,95)
AH_heat
HH_heat
(AH_snp1 + AH_snp2 + AH_snp3) / (HH_snp1 + HH_snp2 + HH_snp3) + plot_annotation(tag_levels = "A")
## Warning: Removed 2 rows containing missing values (geom_point).
Link to genes
/data/popgen/bedtools2/bin/bedtools closest -a /users/a/p/ap1/EcologicalGenomics/myresults/bedfiles_copepod/AA_HH_diffmeth.bed\
-b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
-D b | \
awk '!($10=="-")' > AA_HH_hits.bed # save filename is hits.bed
# count up number of hits
cat AA_HH_hits.bed | wc -l #34
# count number of unique named genes
cat AA_HH_hits.bed | cut -f 8 | sort | uniq -c
cat AA_HH_hits.bed | cut -f 10-12 | head
# getting only the unique hits as a bed file
awk '!seen[$8]++' AA_HH_hits.bed > AA_HH_uniqueHits.bed # save filename is hits.bed
AA_AH_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_AH_hits.bed", sep="\t", header = FALSE)
knitr::kable(AA_AH_hits[,c(1:2,10:13,16:18)], format="pandoc")
| V1 | V2 | V10 | V11 | V12 | V13 | V16 | V17 | V18 |
|---|---|---|---|---|---|---|---|---|
| LS068114.1 | 6322 | O30931 | Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} | dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| LS108947.1 | 32 | . | . | . | . | . | . | -1 |
| LS109091.1 | 3297 | . | . | . | . | . | . | -1 |
| LS133954.1 | 2576 | . | . | . | . | . | . | -1 |
| LS138960.1 | 1140 | . | . | . | . | . | . | -1 |
| LS145990.1 | 2562 | . | . | . | . | . | . | -1 |
| LS166448.1 | 985 | . | . | . | . | . | . | -1 |
| LS208810.1 | 1111 | . | . | . | . | . | . | -1 |
| LS226718.1 | 949 | . | . | . | . | . | . | -1 |
| LS234884.1 | 1619 | . | . | . | . | . | . | -1 |
| LS246311.1 | 346 | . | . | . | . | . | . | -1 |
| LS274951.1 | 141 | A7M7B9 A0ZT94 F1NQP6 | Endonuclease III-like protein 1 {ECO:0000255|HAMAP-Rule:MF_03183} | Bifunctional DNA N-glycosylase/DNA-(apurinic or apyrimidinic site) lyase {ECO:0000255|HAMAP-Rule:MF_03183}; | PF00633;PF00730; | F:4 iron, 4 sulfur cluster binding; F:class I DNA-(apurinic or apyrimidinic site) endonuclease activity; F:DNA-(apurinic or apyrimidinic site) endonuclease activity; F:double-stranded DNA binding; F:metal ion binding; F:oxidized pyrimidine nucleobase lesion DNA N-glycosylase activity; | P:base-excision repair, AP site formation; P:nucleotide-excision repair, DNA incision, 5’-to lesion; | 0 |
| LS283868.1 | 874 | . | . | . | . | . | . | -1 |
| LS328870.1 | 810 | P04323 | Retrovirus-related Pol polyprotein from transposon 17.6 | PF17921;PF17917;PF00078; | F:aspartic-type endopeptidase activity; F:endonuclease activity; F:nucleic acid binding; F:RNA-directed DNA polymerase activity; | P:DNA integration; | 0 | |
| LS336361.1 | 132 | . | . | . | . | . | . | -1 |
| LS339920.1 | 871 | . | . | . | . | . | . | -1 |
| LS364758.1 | 572 | . | . | . | . | . | . | -1 |
| LS377801.1 | 203 | . | . | . | . | . | . | -1 |
AA_AH_unique_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_AH_uniqueHits.bed", sep="\t", header = FALSE)
# only unique
knitr::kable(AA_AH_unique_hits[unique(AA_AH_unique_hits$V10),c(1:2,10:13,16:18)],
format="pandoc")
| V1 | V2 | V10 | V11 | V12 | V13 | V16 | V17 | V18 | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | LS274951.1 | 141 | A7M7B9 A0ZT94 F1NQP6 | Endonuclease III-like protein 1 {ECO:0000255|HAMAP-Rule:MF_03183} | Bifunctional DNA N-glycosylase/DNA-(apurinic or apyrimidinic site) lyase {ECO:0000255|HAMAP-Rule:MF_03183}; | PF00633;PF00730; | F:4 iron, 4 sulfur cluster binding; F:class I DNA-(apurinic or apyrimidinic site) endonuclease activity; F:DNA-(apurinic or apyrimidinic site) endonuclease activity; F:double-stranded DNA binding; F:metal ion binding; F:oxidized pyrimidine nucleobase lesion DNA N-glycosylase activity; | P:base-excision repair, AP site formation; P:nucleotide-excision repair, DNA incision, 5’-to lesion; | 0 |
| 1 | LS068114.1 | 6322 | O30931 | Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} | dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| 2 | LS108947.1 | 32 | . | . | . | . | . | . | -1 |
| 4 | LS328870.1 | 810 | P04323 | Retrovirus-related Pol polyprotein from transposon 17.6 | PF17921;PF17917;PF00078; | F:aspartic-type endopeptidase activity; F:endonuclease activity; F:nucleic acid binding; F:RNA-directed DNA polymerase activity; | P:DNA integration; | 0 |
AA_HH_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_HH_hits.bed", sep="\t", header = FALSE)
knitr::kable(AA_HH_hits[,c(1:2,10:13,16:18)], format="pandoc")
| V1 | V2 | V10 | V11 | V12 | V13 | V16 | V17 | V18 |
|---|---|---|---|---|---|---|---|---|
| LS042748.1 | 7742 | Q91V01 B1B362 O35131 Q8BNH6 | Lysophospholipid acyltransferase 5 | 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; | PF03062; | F:1-acylglycerophosphocholine O-acyltransferase activity; | P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; | 0 |
| LS042748.1 | 7742 | Q91V01 B1B362 O35131 Q8BNH6 | Lysophospholipid acyltransferase 5 | 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; | PF03062; | F:1-acylglycerophosphocholine O-acyltransferase activity; | P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; | 0 |
| LS043541.1 | 2223 | Q87040 | Pro-Pol polyprotein | Pr125Pol; | PF17921;PF00075;PF17919;PF00665;PF00078;PF18103;PF03539; | F:aspartic-type endopeptidase activity; F:DNA-directed DNA polymerase activity; F:metal ion binding; F:RNA binding; F:RNA-directed DNA polymerase activity; F:RNA-DNA hybrid ribonuclease activity; | P:DNA integration; P:DNA recombination; P:establishment of integrated proviral latency; P:viral entry into host cell; P:viral genome integration into host DNA; P:viral penetration into host nucleus; | 0 |
| LS043710.1 | 571 | Q9HDB9 | Endogenous retrovirus group K member 5 Gag polyprotein | HERV-K(II) Gag protein;HERV-K_3q12.3 provirus ancestral Gag polyprotein; | PF02337;PF00607;PF00098; | F:nucleic acid binding; F:structural molecule activity; F:zinc ion binding; | P:viral process; | 0 |
| LS046185.1 | 8259 | O73798 | Insulin-like growth factor 1 receptor | PF00757;PF07714;PF01030; | F:ATP binding; F:insulin receptor substrate binding; F:insulin-like growth factor binding; F:insulin-like growth factor-activated receptor activity; F:metal ion binding; F:phosphatidylinositol 3-kinase binding; F:protein tyrosine kinase activity; | P:insulin-like growth factor receptor signaling pathway; P:multicellular organism development; P:oocyte maturation; P:protein autophosphorylation; P:protein tetramerization; | 0 | |
| LS049748.1 | 4514 | O55092 | STE20-like serine/threonine-protein kinase | STE20-related serine/threonine-protein kinase; | PF00069;PF12474; | F:ATP binding; F:protein homodimerization activity; F:protein serine/threonine kinase activity; | P:apoptotic process; P:protein autophosphorylation; P:regulation of cell migration; P:regulation of focal adhesion assembly; | 0 |
| LS051669.1 | 9015 | Q95SX7 | Probable RNA-directed DNA polymerase from transposon BS | Reverse transcriptase; | PF14529;PF00078; | F:RNA-directed DNA polymerase activity; | P:transposition, DNA-mediated; | 0 |
| LS051669.1 | 9015 | Q95SX7 | Probable RNA-directed DNA polymerase from transposon BS | Reverse transcriptase; | PF14529;PF00078; | F:RNA-directed DNA polymerase activity; | P:transposition, DNA-mediated; | 0 |
| LS051669.1 | 9015 | Q95SX7 | Probable RNA-directed DNA polymerase from transposon BS | Reverse transcriptase; | PF14529;PF00078; | F:RNA-directed DNA polymerase activity; | P:transposition, DNA-mediated; | 0 |
| LS051734.1 | 5459 | O30931 | Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} | dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| LS055310.1 | 71 | . | . | . | . | . | . | -1 |
| LS056368.1 | 228 | . | . | . | . | . | . | -1 |
| LS058076.1 | 184 | . | . | . | . | . | . | -1 |
| LS062329.1 | 6106 | Q0P5D3 | G1/S-specific cyclin-D2 | PF02984;PF00134; | F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; | P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; | 1368 | |
| LS062329.1 | 6106 | Q0P5D3 | G1/S-specific cyclin-D2 | PF02984;PF00134; | F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; | P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; | 1368 | |
| LS062329.1 | 6147 | Q0P5D3 | G1/S-specific cyclin-D2 | PF02984;PF00134; | F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; | P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; | 1327 | |
| LS062329.1 | 6147 | Q0P5D3 | G1/S-specific cyclin-D2 | PF02984;PF00134; | F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; | P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; | 1327 | |
| LS064520.1 | 7213 | . | . | . | . | . | . | -1 |
| LS064920.1 | 4017 | . | . | . | . | . | . | -1 |
| LS064920.1 | 4018 | . | . | . | . | . | . | -1 |
| LS068114.1 | 6322 | O30931 | Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} | dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| LS070286.1 | 6277 | O97972 | Indolethylamine N-methyltransferase | Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; | PF01234; | F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; | P:amine metabolic process; P:methylation; P:response to toxic substance; | -1631 |
| LS070286.1 | 6277 | O97972 | Indolethylamine N-methyltransferase | Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; | PF01234; | F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; | P:amine metabolic process; P:methylation; P:response to toxic substance; | -1631 |
| LS070286.1 | 6280 | O97972 | Indolethylamine N-methyltransferase | Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; | PF01234; | F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; | P:amine metabolic process; P:methylation; P:response to toxic substance; | -1634 |
| LS070286.1 | 6280 | O97972 | Indolethylamine N-methyltransferase | Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; | PF01234; | F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; | P:amine metabolic process; P:methylation; P:response to toxic substance; | -1634 |
| LS075862.1 | 65 | . | . | . | . | . | . | -1 |
| LS082036.1 | 4666 | . | . | . | . | . | . | -1 |
| LS082036.1 | 4896 | . | . | . | . | . | . | -1 |
| LS091612.1 | 430 | . | . | . | . | . | . | -1 |
| LS106615.1 | 2104 | . | . | . | . | . | . | -1 |
| LS108711.1 | 492 | . | . | . | . | . | . | -1 |
| LS108711.1 | 666 | . | . | . | . | . | . | -1 |
| LS109311.1 | 448 | . | . | . | . | . | . | -1 |
| LS109311.1 | 449 | . | . | . | . | . | . | -1 |
| LS109311.1 | 456 | . | . | . | . | . | . | -1 |
| LS109311.1 | 457 | . | . | . | . | . | . | -1 |
| LS109311.1 | 466 | . | . | . | . | . | . | -1 |
| LS109311.1 | 467 | . | . | . | . | . | . | -1 |
| LS138960.1 | 1421 | . | . | . | . | . | . | -1 |
| LS138960.1 | 1936 | . | . | . | . | . | . | -1 |
| LS138960.1 | 2683 | . | . | . | . | . | . | -1 |
| LS143812.1 | 50 | . | . | . | . | . | . | -1 |
| LS148351.1 | 2003 | P04323 | Retrovirus-related Pol polyprotein from transposon 17.6 | PF17921;PF17917;PF00078; | F:aspartic-type endopeptidase activity; F:endonuclease activity; F:nucleic acid binding; F:RNA-directed DNA polymerase activity; | P:DNA integration; | -2 | |
| LS156187.1 | 2187 | . | . | . | . | . | . | -1 |
| LS156187.1 | 2192 | . | . | . | . | . | . | -1 |
| LS156187.1 | 2194 | . | . | . | . | . | . | -1 |
| LS156187.1 | 2203 | . | . | . | . | . | . | -1 |
| LS156187.1 | 2204 | . | . | . | . | . | . | -1 |
| LS156187.1 | 2371 | . | . | . | . | . | . | -1 |
| LS166448.1 | 985 | . | . | . | . | . | . | -1 |
| LS166742.1 | 369 | . | . | . | . | . | . | -1 |
| LS175657.1 | 327 | . | . | . | . | . | . | -1 |
| LS175657.1 | 336 | . | . | . | . | . | . | -1 |
| LS186826.1 | 68 | . | . | . | . | . | . | -1 |
| LS190825.1 | 1484 | . | . | . | . | . | . | -1 |
| LS194438.1 | 77 | . | . | . | . | . | . | -1 |
| LS196635.1 | 188 | . | . | . | . | . | . | -1 |
| LS197389.1 | 867 | . | . | . | . | . | . | -1 |
| LS203364.1 | 253 | Q8I7P9 | Retrovirus-related Pol polyprotein from transposon opus | PF17921;PF17917;PF00665;PF00078; | F:endonuclease activity; F:nucleic acid binding; F:peptidase activity; F:RNA-directed DNA polymerase activity; | P:DNA integration; P:transposition, DNA-mediated; | 353 | |
| LS216243.1 | 1374 | . | . | . | . | . | . | -1 |
| LS219679.1 | 194 | . | . | . | . | . | . | -1 |
| LS226718.1 | 959 | . | . | . | . | . | . | -1 |
| LS226718.1 | 1014 | . | . | . | . | . | . | -1 |
| LS226733.1 | 658 | B9L823 | Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} | dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| LS226733.1 | 658 | Q9J5G5 | Deoxyuridine 5’-triphosphate nucleotidohydrolase | dUTP pyrophosphatase; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| LS231127.1 | 1543 | . | . | . | . | . | . | -1 |
| LS231976.1 | 467 | . | . | . | . | . | . | -1 |
| LS231976.1 | 474 | . | . | . | . | . | . | -1 |
| LS232179.1 | 598 | . | . | . | . | . | . | -1 |
| LS239438.1 | 1057 | . | . | . | . | . | . | -1 |
| LS240607.1 | 71 | . | . | . | . | . | . | -1 |
| LS241916.1 | 1021 | A4FUB7 | Gypsy retrotransposon integrase-like protein 1 | PF17921; | F:nucleic acid binding; | P:DNA integration; | 0 | |
| LS243346.1 | 778 | A4FUB7 | Gypsy retrotransposon integrase-like protein 1 | PF17921; | F:nucleic acid binding; | P:DNA integration; | 5 | |
| LS243346.1 | 781 | A4FUB7 | Gypsy retrotransposon integrase-like protein 1 | PF17921; | F:nucleic acid binding; | P:DNA integration; | 2 | |
| LS244160.1 | 273 | . | . | . | . | . | . | -1 |
| LS255237.1 | 1275 | . | . | . | . | . | . | -1 |
| LS279743.1 | 1105 | . | . | . | . | . | . | -1 |
| LS279852.1 | 203 | . | . | . | . | . | . | -1 |
| LS283678.1 | 900 | . | . | . | . | . | . | -1 |
| LS290113.1 | 1093 | . | . | . | . | . | . | -1 |
| LS291054.1 | 731 | . | . | . | . | . | . | -1 |
| LS294040.1 | 224 | . | . | . | . | . | . | -1 |
| LS311600.1 | 407 | . | . | . | . | . | . | -1 |
| LS315617.1 | 700 | . | . | . | . | . | . | -1 |
| LS316650.1 | 231 | . | . | . | . | . | . | -1 |
| LS320205.1 | 231 | . | . | . | . | . | . | -1 |
| LS320226.1 | 957 | . | . | . | . | . | . | -1 |
| LS335360.1 | 374 | . | . | . | . | . | . | -1 |
| LS338923.1 | 718 | . | . | . | . | . | . | -1 |
| LS338923.1 | 719 | . | . | . | . | . | . | -1 |
| LS338923.1 | 735 | . | . | . | . | . | . | -1 |
| LS348101.1 | 994 | . | . | . | . | . | . | -1 |
| LS351877.1 | 271 | . | . | . | . | . | . | -1 |
| LS359373.1 | 273 | . | . | . | . | . | . | -1 |
AA_HH_unique_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_HH_uniqueHits.bed",
sep="\t", header = FALSE)
# only unique
knitr::kable(AA_HH_unique_hits[unique(AA_HH_unique_hits$V10),c(1:2,10:13,16:18)],
format="pandoc")
| V1 | V2 | V10 | V11 | V12 | V13 | V16 | V17 | V18 | |
|---|---|---|---|---|---|---|---|---|---|
| 12 | LS062329.1 | 6106 | Q0P5D3 | G1/S-specific cyclin-D2 | PF02984;PF00134; | F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; | P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; | 1368 | |
| 10 | LS051734.1 | 5459 | O30931 | Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} | dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; | PF00692; | F:dUTP diphosphatase activity; F:magnesium ion binding; | P:dUMP biosynthetic process; P:dUTP catabolic process; | 0 |
| 14 | LS070286.1 | 6277 | O97972 | Indolethylamine N-methyltransferase | Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; | PF01234; | F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; | P:amine metabolic process; P:methylation; P:response to toxic substance; | -1631 |
| 6 | LS049748.1 | 4514 | O55092 | STE20-like serine/threonine-protein kinase | STE20-related serine/threonine-protein kinase; | PF00069;PF12474; | F:ATP binding; F:protein homodimerization activity; F:protein serine/threonine kinase activity; | P:apoptotic process; P:protein autophosphorylation; P:regulation of cell migration; P:regulation of focal adhesion assembly; | 0 |
| 5 | LS046185.1 | 8259 | O73798 | Insulin-like growth factor 1 receptor | PF00757;PF07714;PF01030; | F:ATP binding; F:insulin receptor substrate binding; F:insulin-like growth factor binding; F:insulin-like growth factor-activated receptor activity; F:metal ion binding; F:phosphatidylinositol 3-kinase binding; F:protein tyrosine kinase activity; | P:insulin-like growth factor receptor signaling pathway; P:multicellular organism development; P:oocyte maturation; P:protein autophosphorylation; P:protein tetramerization; | 0 | |
| 13 | LS062329.1 | 6106 | Q0P5D3 | G1/S-specific cyclin-D2 | PF02984;PF00134; | F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; | P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; | 1368 | |
| 4 | LS043710.1 | 571 | Q9HDB9 | Endogenous retrovirus group K member 5 Gag polyprotein | HERV-K(II) Gag protein;HERV-K_3q12.3 provirus ancestral Gag polyprotein; | PF02337;PF00607;PF00098; | F:nucleic acid binding; F:structural molecule activity; F:zinc ion binding; | P:viral process; | 0 |
| 1 | LS042748.1 | 7742 | Q91V01 B1B362 O35131 Q8BNH6 | Lysophospholipid acyltransferase 5 | 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; | PF03062; | F:1-acylglycerophosphocholine O-acyltransferase activity; | P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; | 0 |
| 9 | LS051669.1 | 9015 | Q95SX7 | Probable RNA-directed DNA polymerase from transposon BS | Reverse transcriptase; | PF14529;PF00078; | F:RNA-directed DNA polymerase activity; | P:transposition, DNA-mediated; | 0 |
| 7 | LS051669.1 | 9015 | Q95SX7 | Probable RNA-directed DNA polymerase from transposon BS | Reverse transcriptase; | PF14529;PF00078; | F:RNA-directed DNA polymerase activity; | P:transposition, DNA-mediated; | 0 |
| 8 | LS051669.1 | 9015 | Q95SX7 | Probable RNA-directed DNA polymerase from transposon BS | Reverse transcriptase; | PF14529;PF00078; | F:RNA-directed DNA polymerase activity; | P:transposition, DNA-mediated; | 0 |
| 11 | LS055310.1 | 71 | . | . | . | . | . | . | -1 |
| 3 | LS043541.1 | 2223 | Q87040 | Pro-Pol polyprotein | Pr125Pol; | PF17921;PF00075;PF17919;PF00665;PF00078;PF18103;PF03539; | F:aspartic-type endopeptidase activity; F:DNA-directed DNA polymerase activity; F:metal ion binding; F:RNA binding; F:RNA-directed DNA polymerase activity; F:RNA-DNA hybrid ribonuclease activity; | P:DNA integration; P:DNA recombination; P:establishment of integrated proviral latency; P:viral entry into host cell; P:viral genome integration into host DNA; P:viral penetration into host nucleus; | 0 |
| 15 | LS070286.1 | 6277 | O97972 | Indolethylamine N-methyltransferase | Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; | PF01234; | F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; | P:amine metabolic process; P:methylation; P:response to toxic substance; | -1631 |
| 2 | LS042748.1 | 7742 | Q91V01 B1B362 O35131 Q8BNH6 | Lysophospholipid acyltransferase 5 | 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; | PF03062; | F:1-acylglycerophosphocholine O-acyltransferase activity; | P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; | 0 |
SCP commandline
scp path/on/the/system ap1@pbio381.edu:~/
vim AA_HH_hits.bed
:set nowrap # to see all the columns
awk '!seen[$8]++' hits.bed | cut -f11- > uniqueHitsNames.bed
that code will remove duplicate lines and only print columns with the name of the gene and the annotation info
try adding sep="\t"
also. read.table has issues sometimes with spaces. read.csv can be better